Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities

نویسندگان

  • Hideo Matsuda
  • T. Ishihara
  • Akihiro Hashimoto
چکیده

This paper presents a method for classifying a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of sequence similarity, we use similarity score computed by a method based on the dynamic programming (DP), such as the Smith–Waterman local alignment algorithm. Although comparison by DP based method is very sensitive, when given sequences include a family of sequences that are much diverged in evolutionary process, similarity among some of them may be hidden behind spurious similarity of some unrelated sequences. Also the distance derived from the similarity score may not be metric (i.e., triangle inequality may not hold) when some sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a family of sequences with a con dence measure. We prove that a restricted version of the pquasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size ¿k or not) is NP-complete. Thus we present an approximation algorithm for classifying a set of sequences using p-quasi complete subgraphs. The e ectiveness of our method is demonstrated by the result of classifying over 4000 protein sequences on the Escherichia coli genome that was completely determined recently. c © 1999—Elsevier Science B.V. All rights reserved

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A graph-based clustering method for a large set of sequences using a graph partitioning algorithm.

A graph-based clustering method is proposed to cluster protein sequences into families, which automatically improves clusters of the conventional single linkage clustering method. Our approach formulates sequence clustering problem as a kind of graph partitioning problem in a weighted linkage graph, which vertices correspond to sequences, edges correspond to higher similarities than given thres...

متن کامل

Image Categorization Using Directed Graphs

Most existing graph-based semi-supervised classification methods use pairwise similarities as edge weights of an undirected graph with images as the nodes of the graph. Recently several new graph construction methods produce, however, directed graph (asymmetric similarity between nodes). A simple symmetrization is often used to convert a directed graph to an undirected one. This, however, loses...

متن کامل

Malware Detection using Classification of Variable-Length Sequences

In this paper, a novel method based on the graph is proposed to classify the sequence of variable length as feature extraction. The proposed method overcomes the problems of the traditional graph with variable length of data, without fixing length of sequences, by determining the most frequent instructions and insertion the rest of instructions on the set of “other”, save speed and memory. Acco...

متن کامل

Detection of Distant Structural Similarities in a Set of Proteins Using a Fast Graph-Based Method

We introduce a method for finding weak structural similarities in a set of protein structures. Proteins are considered at their secondary structure level. The method uses a rigorous graph-theoretical algorithm which finds all structural similarities. Protein structures are modelled as undirected labelled graphs, the so-called protein graphs. We suggest that for detecting the similarities betwee...

متن کامل

Biogeometry Research Faster Multiple Sequence Alignment Algorithms Based on Pairwise Segmentation

Multiple Sequence Alignment (MSA) is a central problem in computational molecular biology --it identifies and quantifies similarities among several protein or DNA sequences.The well-known dynamic programming (DP) algorithms align k sequences (each of length n) by constructing a k-dimensional grid graph of size O(nk), with each of the sequences enumerating one of the dimensions of the grid. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 210  شماره 

صفحات  -

تاریخ انتشار 1999